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Abstract. Restricted Boltzmann machines (RBMs) are probabilistic graphical models that can 
be interpreted as stochastic neural networks. They have attracted much attention as building 
blocks for the multi-layer learning systems called deep belief networks, and variants and ex¬ 
tensions of RBMs have found application in a wide range of pattern recognition tasks. This 
tutorial introduces RBMs from the viewpoint of Markov random fields, starting with the re¬ 
quired concepts of undirected graphical models. Different learning algorithms for RBMs, in¬ 
cluding contrastive divergence learning and parallel tempering, are discussed. As sampling from 
RBMs, and therefore also most of their learning algorithms, are based on Markov chain Monte 
Carlo (MCMC) methods, an introduction to Markov chains and MCMC techniques is provided. 
Experiments demonstrate relevant aspects of RBM training. 
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1 Introduction 

In the last years, models extending or borrowing concepts from restricted Boltzmann machines (RBMs, 
[49]) have enjoyed much popularity for pattern analysis and generation, with applications including 
image classification, processing, and generation [25,51,34,28,48,31]; learning movement patterns [52, 
53]; collaborative filtering for movie recommendations [47]; extraction of semantic document represen¬ 
tations [46,17,60]; and acoustic modeling [40]. As the name implies, RBMs are a special case of general 
Boltzmann machines. The latter were introduced as bidirectionally connected networks of stochastic 
processing units, which can be interpreted as neural network models [1,22]. A Boltzmann machine is 
a parameterized model representing a probability distribution, and it can be used to learn important 
aspects of an unknown target distribution based on samples from this target distribution. These sam¬ 
ples, or observations, are referred to as the training data. Learning or training a Boltzmann machine 
means adjusting its parameters such that the probability distribution the machine represents fits the 
training data as well as possible. 

In general, learning a Boltzmann machine is computationally demanding. However, the learning 
problem can be simplified by imposing restrictions on the network topology, which leads us to RBMs, 
the topic of this tutorial. In Boltzmann machines two types of units can be distinguished. They have 
visible neurons and potentially hidden neurons. Restricted Boltzmann machines always have both types 
of units, and these can be thought of as being arranged in two layers, see Fig. 1 for an illustration. The 
visible units constitute the first layer and correspond to the components of an observation (e.g., one 
visible unit for each pixel of a digital input image). The hidden units model dependencies between the 
components of observations (e.g., dependencies between the pixels in the images) and can be viewed 
as non-linear feature detectors [22]. In the RBMs network graph, each neuron is connected to all the 
neurons in the other layer. However, there are no connections between neurons in the same layer, and 
this restriction gives the RBM its name. 

* The final version of this manuscript has been published as: Asja Fischer and Christian Igel. Training Re¬ 
stricted Boltzmann Machines: An Introduction. Pattern Recognition 47:25-39, 2014 



2 


learning 


generating 



hidden units 



■'I 


Q 


q 

1 


j 




training data 


hidden units 



samples 


Fig. 1 . Left: Learning an RBM corresponds to fitting its parameters such that the distribution represented by 
the RBM models the distribution underlying the training data, here handwritten digits. Right: After learning, 
the trained RBM can be used to generate samples from the learned distribution. 


Now, what is learning RBMs good for? After successful learning, an RBM provides a closed-form 
representation of the distribution underlying the training data. It is a generative model that allows 
sampling from the learned distribution (e.g., to generate image textures [34, 28]), in particular from the 
marginal distributions of interest, see right plot of Fig. 1. For example, we can fix some visible units 
corresponding to a partial observation (i.e., we set the corresponding visible variables to the observed 
values and treat them as constants) and sample the remaining visible units to complete the observation, 
for example, to solve an image inpainting task [28,51], see Fig. 7 in Sec. 7. In this way, RBMs can 
also be used as classifiers: The RBM is trained to model the joint probability distribution of inputs 
(explanatory variables) and the corresponding labels (response/output variables), both represented 
by the visible units of the RBM. This is illustrated in the left plot of Fig. 2. Afterwards, a new 
input pattern can be clamped to the corresponding visible variables and the label can be predicted by 
sampling, as shown in the right plot of Fig. 2). 

Compared to the 1980s when RBMs were first introduced [49], they can now be applied to more 
interesting problems due to the increase in computational power and the development of new learning 
strategies [21]. Restricted Boltzmann machines have received a lot of attention recently after being 
proposed as the building blocks for the multi-layer learning architectures called deep belief networks 
(DBNs, [25,23]). The basic idea underlying these deep architectures is that the hidden neurons of a 
trained RBM represent relevant features of the observations, and that these features can serve as input 
for another RBM, see Fig. 3 for an illustration. By stacking RBMs in this way, one can learn features 
from features in the hope of arriving at a high-level representation. 

It is an important property that single as well as stacked RBMs can be reinterpreted as deterministic 
feed-forward neural networks. When viewed as neural networks they are used as functions mapping the 
observations to the expectations of the latent variables in the top layer. These can be interpreted as the 
learned features, which can, for example, serve as inputs for a supervised learning system. Furthermore, 
the neural network corresponding to a trained RBM or DBN can be augmented by an output layer 
where the additional units represent labels (e.g., corresponding to classes) of the observations. Then 
we have a standard neural network for classification or regression that can be further trained by 
standard supervised learning algorithms [43]. It has been argued that this initialization (or unsupervised 
pretraining) of the feed-forward neural network weights based on a generative model helps to overcome 
some of the problems that have been observed when training multi-layer neural networks [25]. 

Boltzmann machines can be regarded as probabilistic graphical models, namely undirected graph¬ 
ical models also known as Markov random fields (MRFs) [29]. The embedding into the framework of 
probabilistic graphical models provides immediate access to a wealth of theoretical results and well- 



































3 


learning with labels 


classification 


hidden units 



training data 


hidden units 



sampling 


visible units 

t I 



image without 
label 


label 


Fig. 2. Left: RBM trained on labeled data, here images of handwritten digits combined with ten binary 
indicator variables, one of which is set to 1 indicating that the image shows a particular digit while the others 
are set to 0. Right: The label corresponding to an input image is obtained by fixing the visible variables 
corresponding to the image and then sampling the remaining visible variables corresponding to the labels from 
the (marginalized) joined probability distribution of images and labels modeled by the RBM. 


developed algorithms. Therefore, we introduce RBMs from this perspective after providing the required 
background on MRFs. This approach and the coverage of more recent learning algorithms and theoret¬ 
ical results distinguishes this tutorial from others. Section 2 will provide the introduction to MRFs and 
unsupervised MRF learning. Training of RBMs (i.e., the fitting of the parameters) is usually based 
on gradient-based maximization of the likelihood of the RBM parameters given the training data, 
that is, the probability that the distribution modeled by the RBM generated the data. Computing 
the likelihood of an undirected graphical model or its gradient is in general computationally intensive, 
and this also holds for RBMs. Thus, sampling-based methods are employed to approximate the like¬ 
lihood gradient. Sampling from an undirected graphical model is in general not straightforward, but 
for RBMs, Markov chain Monte Carlo (MCMC) methods are easily applicable in the form of Gibbs 
sampling. These methods will be presented along with the basic concepts of Markov chain theory in 
Sec. 3. Then RBMs will be formally described in Sec. 4, and the application of MCMC algorithms to 
RBMs training will be the topic of Sec. 5. Finally, we will discuss RBMs with real-valued variables 
before concluding with some experiments. 


2 Graphical models 

Probabilistic graphical models describe probability distributions by mapping conditional dependence 
and independence properties between random variables using a graph structure. Two sets of random 
variables X and Y are conditionally independent given a set of random variables Z if for all values 
x, y, z of the variables, we have p(x, y\z)= p(x \ z)p(y \ z), which implies p{x \ y,z) = p(x \ z ) and 
p{y \ x,z) = p(y | z). Visualization by graphs can help to develop, understand, and motivate proba¬ 
bilistic models. Furthermore, complex computations (e.g., marginalization) can be derived efficiently 
by using algorithms exploiting the graph structure. 

There exist graphical models associated with different kinds of graph structures, for example, factor 
graphs , Bayesian networks associated with directed graphs, and Markov random fields , which are also 
called Markov networks or undirected graphical models. This tutorial focuses on the last. A general 
introduction to graphical models for machine learning can, for example, be found in [5]. The most 
comprehensive resource on graphical models is the textbook by Roller and Friedman [29]. 
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Fig. 3. The trained RBM can be used as a feature extractor. An input pattern is clamped to the visible 
neurons. The conditional probabilities of the hidden neurons to be 1 are interpreted as a new representation 
of the input. This new representation can serve as input to another RBM or to a different learning system. 


2.1 Undirected graphs and Markov random fields 

First, we will summarize some fundamental concepts from graph theory. An undirected graph is an 
ordered pair G = (V,E), where V is a finite set of nodes and E is a set of undirected edges. An edge 
consists of a pair of nodes from V. If there exists an edge between two nodes v and w, i.e., {u, w} £ E, 
ui belongs to the neighborhood of v and vice versa. The neighborhood Af v = {w £ V : {w, v } £ E} of 
v is defined by the set of nodes connected to v. An example of an undirected graph can be seen in 
Fig. 4. Here the neighborhood of node V4 is {v2, V5, U 6 j-. 



Fig. 4. An example of an undirected graph. The nodes vi and vs are separated by V = {v 4 ,vs}. 


A clique is a subset of V in which all nodes are pairwise connected. A clique is called maximal if no 
node can be added such that the resulting set is still a clique. In the undirected graph in Fig. 4, both 
{'Ci, V 2 } and {fi, U 2 , U 3 } are cliques but only the latter is maximal. In the following, we will denote by 
C the set of all maximal cliques of an undirected graph. We call a sequence of nodes iq, V 2 , ■ ■ ■, v m £ V, 
with {vi, Vi+ 1 } £ E for i = 1,..., m — 1 a path from v\ to v m . A set V CV separates two nodes v £ V 
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and w £ V if every path from v to w contains a node from V. For an illustration of this concept, see 
Fig. 4). 

Given an undirected graph G = (V, E), we now associate, to each node v G V, a random variable 
X v taking values in a state space A v . To ease the notation, we assume A v = A for all v G V. The 
set of random variables X = (X v ) v& v is called a Markov random field (MRF) if the joint probability 
distribution p fulfills the (local) Markov property w.r.t. the graph. This property is fulfilled if for all 
v G V the random variable X v is conditionally independent of all other variables given its neighborhood 
( X w )weM v ■ That is, for all v G V and all x G A^ v \, one has that p(x v | (x u ,) U)€ y\{„}) = p(x v \ {x w ) we jy v ). 

There exist two other types of Markov properties, which are equivalent to the local Markov property 
if the probability distribution of the MRF is strictly positive. The MRF is said to have the global Markov 
property if for any three disjunct subsets A, B, S C V, such that all nodes in A and B are separated 
by S , the variables (X a ) a&J \ and are conditionally independent given (X s ) sg< 5 , i.e., for all 

x G A\ v 1 one has that p (( x a ) a eA I {xt)teSue) = p((x a )a£A I (xt)tes)- The pairwise Markov property 
means that any two non-adjacent variables are conditionally independent given all other variables: if 
{v,w} £ E, then p(x v ,x w \ (x t ) te v\{v,w}) = p{x v \ ( xt)tev\{v,w})p(x w | (xt)tev\{v,w}) for all x G yl |y| . 

Since conditional independence of random variables and the factorization properties of the joint 
probability distribution are closely related, one can ask if there exists a general factorization of MRF 
distributions. An answer to this question is given by the Hammersley-Clifford Theorem (for rigorous 
formulations and proofs we refer to [32,29]): 

Theorem 1. A strictly positive distribution p satisfies the Markov property w.r.t. an undirected graph 
G if and only if p factorizes over G. 

A distribution is said to factorize over an undirected graph G with maximal cliques C if there exists 
a set of non-negative functions {V , c}ccC> called potential functions, satisfying 

Mx,x G A |y| : (x c ) ce c = (x c ) ceC => tpc{x) = ipc{x) (1) 

and 

P(x) = ^ n ^c{x). (2) 

^ cec 

The normalization constant Z = Y) x Ylcec V’ c( x c ) is called the partition function. 

If p is strictly positive, the same holds for the potential functions. Thus we can write 

p(x) = \ n ^c(xc) = M C (* C ) = I e -^W , (3) 

^ cec 

where we call E = X^ceC 'fic(xc) the energy function. Thus, the probability distribution of every 
MRF can (if it is strictly positive) be expressed in the form given by (3), which is also called the Gibbs 
distribution. 

2.2 Unsupervised MRF learning 

Unsupervised learning means learning (important aspects of) an unknown distribution q based on 
sample data. This includes finding new representations of the data that foster learning, generalization, 
and communication. If we assume that the structure of the graphical model is known and that the 
energy function belongs to a known family of functions parameterized by 6, unsupervised learning of 
a data distribution with an MRF means adjusting the parameters 6. We write p(x \ 6) when we want 
to emphasize the dependency of a distribution on its parameters. 

We consider training data S = {x\, ..., X(}. The data samples are assumed to be independent and 
identically distributed (i.i.d.). That is, they are drawn independently from some unknown distribution 
q. A standard way of estimating the parameters of a statistical model is maximum-likelihood estimation. 
Applied to MRFs, this corresponds to finding the MRF parameters that maximize the probability of 



S under the MRF distribution, i.e., training corresponds to finding the parameters 6 that maximize 
the likelihood given the training data. The likelihood C : 0 —> R of an MRF given the data set S maps 

i 

parameters Q from a parameter space 0 to C{6 \ S) = J~[ p(xi \ 6). Maximizing the likelihood is the 

i =1 

same as maximizing the log-likelihood given by 

i l 

ln£(0 | S) = In | 6) = Y l M x i | 0) . (4) 

»= 1 2 = 1 

For the Gibbs distribution of an MRF, it is in general not possible to find the maximum likelihood 
parameters analytically. Thus, numerical approximations have to be used, for example gradient ascent, 
which is described below. 

Maximizing the likelihood corresponds to minimizing the distance between the unknown distribu¬ 
tion q underlying S and the distribution p of the MRF in terms of the Kullback-Leibler divergence 
(KL divergence), which for a finite state space fl is given by 

KL(<jr||p) = Y q {&) ln = Y q (*) ln ~ q W ln P( x ) ■ (5) 

The KL divergence is a (non-symmetric) measure of the difference between two distributions. It is 
always positive, and it is zero if and only if the distributions are the same. As becomes clear by 
equation (5), the KL divergence can be expressed as the difference between the entropy of q and a 
second term. Only the latter depends on the parameters subject to optimization. Approximating the 
expectation over q in this term by the training samples from q results in the log-likelihood. Therefore, 
maximizing the log-likelihood corresponds to minimizing the KL divergence. 


Optimization by gradient ascent. If it is not possible to find parameters maximizing the likelihood 
analytically, the usual way to find them is by gradient ascent on the log-likelihood. This corresponds 
to iteratively updating the parameters d^' 1 to based on the gradient of the log-likelihood. Let 

us consider the following update rule: 

Q(t+1) = g(t) + V _A_ ( ln £(0(t) I _ X 0(t) + vAd^-V ( 6 ) 

---v--' 

= A6 (t) 

If the constants A € Rq" and v £ Rjj" are set to zero, we have vanilla gradient ascent. The constant 
77 G R + is the learning rate. As we will see later, it can be desirable to strive for models with weights 
having small absolute values. To achieve this, we can optimize an objective function consisting of the 
log-likelihood minus half of the norm of the parameters ||0|| 2 /2 weighted by A. This method is called 
weight decay, and penalizes weights with large magnitude. It leads to the term in our update 

rule ( 6 ). In a Bayesian framework, weight decay can be interpreted as assuming a zero-mean Gaussian 
prior on the parameters. The update rule can be further extended by a momentum term, Ad 
weighted by the parameter u. Using a momentum term helps against oscillations in the iterative update 
procedure and can speed up the learning process, as is seen in feed-forward neural network training 

[43]. 


Introducing latent variables. Suppose we want to model an m-dimensional unknown probability 
distribution q (e.g., each component of a sample corresponds to one of to pixels of an image). Typically, 
not all the variables X = (X v ) ve y in an MRF need to correspond to some observed component, 
and the number of nodes is larger than m. We split X into visible (or observed) variables V = 
(Vi,..., V m ) corresponding to the components of the observations and latent (or hidden ) variables H — 
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(Hi,... ,H n ) given by the remaining n = \V\ — m variables. Using latent variables allows describing 
complex distributions over the visible variables by means of simple (conditional) distributions. In this 
case, the Gibbs distribution of an MRF describes the joint probability distribution of (V, H) and one 
is usually interested in the marginal distribution of V, which is given by 

P( v ) = = ^J2 e ~ E(V ’ h) ’ ( 7 ) 

h h 

where Z = Y v h e ~ E ( v E) While the visible variables correspond to the components of an observation, 
the latent variables introduce dependencies between the visible variables (e.g., between the pixels of 
an input image). 


Log-likelihood gradient of MRFs with latent variables. Restricted Boltzmann machines are 
MRFs with hidden variables and RBM learning algorithms are based on gradient ascent on the log- 
likelihood. For a model of the form (7) with parameters 0 , the log-likelihood given a single training 
example v is 


In C(0 | v) = In p(v | 0) = In - ^ e ~ E ( v ’ h) = In ^ e ~ E{ - v ' h) - In ^ e~ E(v 


h) 


( 8 ) 


and for the gradient we get 

= m ('■“ E ~ 4 (>”£ 


00 80 


v.h 


l 


1 \ ' -E(v.h) 9E(v, h) _ ± \ ' 

^ e -E(^,h)Z^ e QQ + w e~E(v,h) 1 


-E(v,h) dE(v,h) 

do 


dO e ~ E(v '‘ 

v,h “ ,,v 

. . dE(v, h) . dE(v, h) 

= -^2p(h\v) —^ + Z>(«- h) 

h v.h 


(9) 


In the last step we used that the conditional probability can be written 


p(h | v) 


P(v,h) 

p(v) 


—E(v,h) e ~E(v,h) 

1 J2 e~ E ( v ’ h ) ~ e~ E( - v ' h '> 

h h 


( 10 ) 


Note that the last expression of (9) is the difference between two expectations: the expected values of 
the energy function under the model distribution and under the conditional distribution of the hidden 
variables given the training example. Directly calculating this sums, which run over all values of the 
respective variables, leads to a computational complexity which is in general exponential in the number 
of variables of the MRF. To avoid this computational burden, the expectations can be approximated 
by samples drawn from the corresponding distributions based on MCMC techniques. 


3 Markov chains and Markov chain Monte Carlo techniques 

Markov chains play an important role in RBM training because they provide a method to draw 
samples from “complicated” probability distributions such as the Gibbs distribution of an MRF. This 
section will serve as an introduction to some fundamental concepts of Markov chain theory. A detailed 
introduction can be found, for example, in [7] and the aforementioned textbooks [5,29]. An emphasis 
will be put on Gibbs sampling as an MCMC technique often used for MRF training and in particular 
for training RBMs. 



3.1 Markov chains 


A Markov chain is a time discrete stochastic process, where the next state of the system depends only 
on the current state and not on the sequence of events that preceded it. Formally, a Markov chain is 
a family of random variables X = { X^ \ k £ No} taking values in a (in the following considerations, 
finite) set ft and for which Vfc > 0 and V), i, Zo, ... , ik-i £ 12 one has 

= Pr (X( fc +P = j | X™ = i, X^-V = 4-i, • ■ •, * (0) = z 0 ) = Pr = j \ X « = z) . (11) 

The “memorylessness” of a stochastic process expressed by (11) is also referred to as Markov property 
(considering temporal neighborhood, while the Markov properties discussed in Sec. 2.1 considered 
neighborhood induced by the graph topology). If for all points in time k > 0 the p[ k ' > have the same 
value pij (i.e., the transition probabilities do not change over time), the chain is called homogeneous 
and the matrix P = (Pij)ijen is called the transition matrix of the homogeneous Markov chain. 

If the starting distribution (i. e ., the probability distribution of AV°)) is given by the probability 
vector /z ( °i = (p(°\i)) ie n, with /i®(z) = Pr(A^ 0 i = z), the distribution of X^ is given by 
^( k ) T = M (0) T pfe 

A distribution n for which 7 t t = 7r T P is called a stationary distribution. If the Markov chain at 
time k has reached the stationary distribution = 7T, then all subsequent states will be distributed 
accordingly, that is, p( k+n ') = -k for all n £ N. A sufficient (but not necessary) condition for a distri¬ 
bution 7r to be stationary w.r.t. a Markov chain described by the transition probabilities Pij,i,j £ ft 
is that Vz, j £ ft 

n{i) Pij = n(j)pji . (12) 

This is called the detailed balance condition. 

Especially relevant are Markov chains for which there exists a unique stationary distribution. For 
a finite state space ft, this is the case if the Markov chain is irreducible. A Markov chain is irreducible 
if one can get from any state in ft to any other in a finite number of transitions or, more formally, 
Vz, j £ ft 3k > 0 with Pr(X( fe ) = j | A'® = z) > 0. 

A chain is called aperiodic if every state can reoccur at irregular times. Formally, a chain is aperiodic 
if for all i £ ft the greatest common divisor of all elements in the set {k £ No | Pr(X^ fc ^ = z | = 

z) > 0} is 1. One can show that an irreducible and aperiodic Markov chain on a finite state space 
is guaranteed to converge to its stationary distribution. Let for two distributions a and /3 on a finite 
state space ft the distance of variation be defined as 

d v (a,/3) = ^\cx-f3\ = ^^2\a(x)~ p{x)\ . (13) 

To ease the notation, we allow both row and column probability vectors as arguments of the functions 
in (13). Then we have: 

Theorem 2. Let tv be the stationary distribution of an irreducible and aperiodic Markov chain on a 
finite state space with transition matrix P. For an arbitrary starting distribution p, 

lim dv(p T P k , tv t ) = 0 . (14) 

k—too 


For a proof see, for instance, [7]. 

Markov chain Monte Carlo methods make use of this convergence theorem for producing samples 
from a probability distribution by setting up a Markov chain that converges to the desired distributions. 
Suppose you want to sample from a distribution q with a finite state space. Then you construct an 
irreducible and aperiodic Markov chain with stationary distribution tv = q. This is a non-trivial task. 
If k is large enough, the state x 1 ^ of X^ from the constructed chain is then approximately a sample 
from tv and therefore from q. Gibbs sampling [18] is such a MCMC method and will be described in 
the following section. 
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3.2 Gibbs sampling 


Gibbs sampling is a simple MCMC algorithm for producing samples from the joint probability dis¬ 
tribution of multiple random variables. The basic idea is to construct a Markov chain by updating 
each variable based on its conditional distribution given the state of the others. In the following, 
we will describe this procedure by explaining how Gibbs sampling can be used to produce samples 
(approximately) from the Gibbs distribution of an MRF. 

We consider an MRF X = (Xi,... ,Xn) w.r.t. an undirected graph G = (V,E), where V = 
(1, ...,1V} for the sake of clearness of notation. The random variables X,. i £ V take values in 
a finite set A and tt(x) = ^e~ £ ^ is the joint probability distribution of X. Furthermore, if we 
assume that the MRF changes its state over time, we can consider X = {X^ \ k € No} as a Markov 
chain taking values in 17 = A N . Then X^ = (x[ k \...,X^) describes the state of the MRF at 
time k > 0. Between two successive points in time, the new state of the chain is produced by the 
following procedure. First, a variable 7Q, i £ V is randomly picked with a probability q(i) given by a 
strictly positive probability distribution q on V. Then, the new state for X, is sampled based on its 
conditional probability distribution given the state (x v ) ve v\i of all other variables (X v ) ve v\i- We have 
7T (Xi | {x v ) l ,ev\i) = n ( x i I ( x w)w&Xi) because of the local Markov property of MRFs (cf. Sec. 2.1). The 
transition probability p xy for two states x , y of the MRF X with x ^ y is 


Pxy — 


?(0 7r (Vi I ( x v) v ev\i) , 

0 , 


if 3i £V so that \/v £ V with v ^ i: x v = y v 
else . 


(15) 


And the probability, that the state of the MRF x stays the same, is p xx = q(i) 7r ( x i | (x v ) t >gy\i). 

iev 


Convergence of the Gibbs chain. To show that the Markov chain defined by these transition 
probabilities (the so called Gibbs chain) converges to the joint distribution tt of the MRF, we have to 
prove that tt is the stationary distribution of the Gibbs chain and that the chain is irreducible and 
aperiodic (see Theorem 2). 

It is easy to see that tt is the stationary distribution by showing that the detailed balance condition 
(12) holds: for x = y this follows directly. If x and y differ in the value of more than one random 
variable, then this follows from the fact that p yx = p xy = 0. Assume now that x and y differ only in 
the state of exactly one variable A}, i.e., yj = Xj for j ^ i and y, ^ x*. Then 


n(x)p xy = 7r(x)q(i)7r (y» | ( x v ) veV \i ) = 7r (x iy (x v ) veV \i) q(i) 


^ ( Vi 5 (,%v)v€V\i) 
TT ((^i;)u £V\i) 


= tt (; yi , (x v ) v&v \i) q(i) n ^ = n{y)q(i)TT (x r \ (. x v ) veV \i ) = n{y)p yx . (16) 

( \Xy ) V £V\i ) 


Thus, the detailed balance condition is fulfilled and tt is the stationary distribution. 

Since tt is strictly positive, so are the conditional probability distributions of the single variables. 
This means that every single variable X, can take every state x% € A in a single transition step and thus 
every state of the whole MRF can reach any other in A N in a finite number of steps, so the Markov 
chain is irreducible. Furthermore, it follows from the positivity of the conditional distributions that 
Pxx > 0 for all x £ A N , and thus that the Markov chain is aperiodic. Aperiodicity and irreducibility 
guarantee that the chain converges to the stationary distribution tt. 

In practice, the single random variables to be updated are usually not chosen at random based 
on a distribution q, but in a fixed predefined order. The corresponding algorithm is often referred to 
as the periodic Gibbs sampler. If P is the transition matrix of the Gibbs chain, the convergence rate 
of the periodic Gibbs sampler to the stationary distribution of the MRF is bounded by the following 
inequality (see for example [7]): 

\fiP k - tt| < * - 7r|(l - e~ NA ) k , 


(17) 
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where A = sup igV Si and Si = sup{|£(a;) — £(y)\;xi = y^i £ V with i ^ /}. Here y is an arbitrary- 
starting distribution and \\y — 7r| is the distance in variation as defined in (13). 


Gibbs sampling and Metropolis-Hastings algorithms. Gibbs sampling belongs to the broader 
class of Metropolis-Hastings algorithms [19], see [42] for a good overview. All MCMC algorithms of 
this class generate the transitions of a Markov chain in two substeps. In the first substep, a candidate 
state is picked at random from a so called proposal distribution. In the second substep, the candidate 
state is accepted as the new state of the Markov chain with an acceptance probability ensuring that 
detailed balance holds. The proposal distribution of Gibbs sampling always suggests to flip the current 
state of a single random variable and accepts this with the conditional probability of the suggested 
state given the states of the remaining random variables. 

For sampling in Ising models, the same proposal distribution (“flip the state”) has been combined 
with the acceptance probability min(l, y^y), where x denotes the current and x' the new state of the 
Markov chain. As discussed in [42], this sampling algorithm may be advantageous over Gibbs sampling. 
Recently, it has been shown that this indeed also holds true for RBMs [8]. 

4 Restricted Boltzmann machines 

An RBM (also denoted as a Harmonium [49]) is an MRF associated with a bipartite undirected graph as 
shown in Fig. 5. It consists of m visible units V = (Vi,..., V m ) representing the observable data, and n 
hidden units H = (Hi ,..., H n ) to capture the dependencies between the observed variables. In binary 
RBMs, our focus in this tutorial, the random variables (V, H) take values (v, h) £ {0, l} m + n and the 
joint probability distribution under the model is given by the Gibbs distribution p(v, h ) = ^e~ E ( v ’ h ^ 
with the energy function 


E(v, = w H h i v i ~ b i v i ~ Cihi ■ ( 18 ) 

i= 1 j =1 j =1 i =1 

For alH £ {1,..., n} and j £ {1,..., m}, Wij is a real valued weight associated with the edge between 
the units Vj and Hi, and bj and c, are real valued bias terms associated with the jth visible and the 
zth hidden variable, respectively. 



Fig. 5. The network graph of an RBM with n hidden and m visible units. 


The graph of an RBM has connections only between the layer of hidden and the layer of visible 
variables, but not between two variables of the same layer. In terms of probability, this means that the 
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hidden variables are independent given the state of the visible variables and vice versa: 

n m 

p(h | v) = n p(h-i | v) and p(v \ h) = l[p(v j \h) . (19) 

*= 1 3 =1 

Thus, due to the absence of connections between hidden variables, the conditional distributions p(h \ v) 
and p(v | h ) factorize nicely, and simple expressions for the factors will be given in Sec. 4.1. 

The conditional independence between the variables in the same layer makes Gibbs sampling espe¬ 
cially easy: instead of sampling new values for all variables subsequently, the states of all variables in 
one layer can be sampled jointly. Thus, Gibbs sampling can be performed in just two steps: sampling 
a new state h for the hidden neurons based on p(h \ v) and sampling a state v for the visible layer 
based on p(v \ h). This is also referred to as block Gibbs sampling. 

Now, how does the RBM distribution over V (e.g., the space of images) look like? The marginal 
distribution (7) of the visible variables becomes 


E b 3 


P(v) = h )=yY, e E(V ' h] = 7 E E ' • • E eJ_1 IT 


hi (c{+ J2 w ij v j) 


h i h<2 h n 


1 


i < j j \—> 

= z 6 ' -1 E' 


i( c i+E Wijvj) h 2 (c 2 + E wijVj) h n (c n + J2 w nj Vj) 


E 


E- 


hx 


m / m \ m 

1 T.bjVj™ hi[ci+Yi WijVj) 1 ™ -A-/ Ci+Y. WijVj 

~z e ‘" IIE e =zH e R 1 + e 

i—1 hi j —1 i—1 ^ 


( 20 ) 


This eciuation shows why a (marginalized) RBM can be regarded as a product of experts model [21, 
58], in which a number of “experts” for the individual components of the observations are combined 
multiplicatively. 

Any distribution on {0, l} m can be modeled arbitrarily well by an RBM with m visible and k + 1 
hidden units, where k denotes the cardinality of the support set of the target distribution, that is, the 
number of input elements from {0, l} m that have a non-zero probability of being observed [33]. It has 
been shown recently that even fewer units can be sufficient, depending on the patterns in the support 
set [41]. 


4.1 RBMs and neural networks 

The RBM can be interpreted as a stochastic neural network, where the nodes and edges correspond to 
neurons and synaptic connections, respectively. The conditional probability of a single variable being 
one can be interpreted as the firing rate of a (stochastic) neuron with sigmoid activation function 
sig(x) = 1/(1 + e ~ x ), because 


p{Hi 


1 I v) = sig f W iJ V 3 + °i ) 
V j=i * 


and 

( n 

p{Vj = 11 h) = sig I '^2 WijK + bj 
To see this, let v_i denote the state of all visible units except the Ith one and let us define 

n 

ai{h) = -Ywqhj - bi 

i =1 



( 21 ) 


( 22 ) 


( 23 ) 
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and 

n m m n 

P(v-I,h) = E w ij h i v i~ E bjVj-^Cihi . (24) 

*=i j= i,j#i * =1 

Then E(v, h) = 0(v-i, h) +viai(h), where vioq(h) collects all terms involving vi and we can write [2]: 

p(Vi = l\h)= p{Vi = 11 v-i,h) = P ^ Vl 

P\ v -U h) 

e ~E(vi=l,v_i,h) e —f}(v-i,h)—l-ai(h) 

g—E(vi=l,v-i,h) _j_ g— E(vi=0,v-i,h) e —P(v-i,h) — l-ai(h) _|_ g—/3(u_i,h.) —0-a; (h) 
e —/3(v-i,h) . e —ai(h) e ~/3(v-i,h) _ e ~ai(h) 

e -p(v-i,h) . e -ai(h) _|_ e -/9(v_i,h) g -/3 (v_ l ,h) . ( e ~ai(h ) _|_ j) 

= e -ai(h) + Y = r ( + ! = 1 + e a,(h) = Si g(~^(fe)) = sig + bi ) (25) 

As mentioned in the introduction, an RBM can be reinterpreted as a standard feed-forward neural 
network with one layer of nonlinear processing units. From this perspective, the RBM is viewed as a 
deterministic function {0, l} m —> K™ that maps an input v £ {0, l} m to y £ R" with yi = p(Hi = 11 v). 
That is, an observation is mapped to the expected value of the hidden neurons given the observation. 

4.2 The gradient of the log-likelihood 

As shown in Sec. 2.2, the log-likelihood gradient of an MRF can be written as the sum of two expec¬ 
tations, see (9). For RBMs the first term of (9) (i.e., the expectation of the energy gradient under the 
conditional distribution of the hidden variables given a training sample v) can be computed efficiently 
because it factorizes nicely. For example, w.r.t. the parameter u\j we get: 


p(h k | v)hiVj = EE p(hi | v)p(h^i | v)hiVj 

h ^ h h k —1 hi h-i 


E 


( m 

E 

j=i 


U ij Vj — I - a 


(26) 


Since the second term in (9) (i.e., the expectation of the energy gradient under the RBM dis¬ 
tribution) can also be written as J 2 p( v ) Hp(h 1 v ) or Y 1 pWY 1 p( v I fe) dE d 0 h ^ > we can also 

i > h h v 

reduce its computational complexity by applying the same kind of factorization to the inner sum, 
either factorizing over the hidden variables as shown above or factorizing over the visible variables in 
an analogous way. However, the computation remains intractable for regular sized RBMs because its 
complexity is still exponential in the size of the smallest layer (the outer sum still runs over either 2 m 
or 2” states). 

Using the factorization trick (26) the derivative of the log-likelihood of a single training pattern v 
w.r.t. the weight w t j becomes 


d\nC( 6 \v) dE(v,h) 

= ~ 2 _^p( h K 


E^’fc) 


dE(v, h) 


d Wij 1 ' d Wij ft ' dWij 


Ep(^i«)Mj -E p( v ) ^^, P(h I v)fkVj — p(Ht — 1| v)vj ^ \ p( v )p(Hj — lju)i 


(27) 


V 


V 
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For the mean of this derivative over a training set S = {iq,... ,v?} often the following notations 
are used: 


1 dlnC(6 | v) 1 y- 
P • J r)m - P f ^ 


dwi i 


•ues L 


-E. 


•p(h | v) 


dE(v, h) 
diva 


■E. 


'p(h,v) 


dE(v,h) 


dwi 


— p y] [®p(h | v) [ v ihj] Ep(fc,t») [Vj/lj]] 

i 

= ( v ihj)p(h I v)q(v) ~ ( v ihj)p(h,v) 

with q denoting the empirical (or data) distribution. This gives the often stated rule: 

dln£(0 \v) , u \ , , \ 

-bT- « ( W ^j)data - m /model 

i>es OWi i 


(28) 


(29) 


Analogously to (27) we get the derivatives w.r.t. the bias parameter bj of the jtlr visible variable 

<91n£(0 | v) 


dbj 


= v 3 -J2 p ^ 1 


and w.r.t. the bias parameter c,; of the ith hidden variable 

91n C(9 | v) 


dci 


= p{Hi = l|v) -^p(u)p(i7, ; = l|w) 


(30) 


(31) 


To avoid the exponential complexity of summing over all values of the visible variables (or all 
values of the hidden if one decides to factorize over the visible variables beforehand) when calculating 
the second term of the log-likelihood gradient—or the second terms of (27), (30), and (31)—one 
can approximate this expectation by samples from the model distribution. These samples can, for 
example, be obtained by Gibbs sampling. This requires running the Markov chain “long enough” to 
ensure convergence to stationarity. Since the computational costs of such an MCMC approach are still 
too large to yield an efficient learning algorithm, common RBM learning techniques, as described in 
the following section, introduce additional approximations. 


5 Approximating the RBM log-likelihood gradient 

All common training algorithms for RBMs approximate the log-likelihood gradient given some data 
and perform gradient ascent on these approximations. Selected learning algorithms will be described 
in the following section, starting with contrastive divergence learning. 


5.1 Contrastive divergence 

Obtaining unbiased estimates of the log-likelihood gradient using MCMC methods typically requires 
many sampling steps. However, it has been shown that estimates obtained after running the chain for 
just a few steps can be sufficient for model training [21]. This leads to contrastive divergence (CD) 
learning, which has become a standard way to train RBMs [21,4, 24,3, 23]. 

The idea of fc-step contrastive divergence learning (CD-fc) is quite simple: instead of approximating 
the second term in the log-likelihood gradient by a sample from the RBM-distribution (which would 
require running a Markov chain until the stationary distribution is reached), a Gibbs chain is run 
for only k steps (and usually k = 1). The Gibbs chain is initialized with a training example of 
the training set and yields the sample v^ after k steps. Each step t consists of sampling from 
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p(h\v (*)) and subsequently sampling from p(v\h ^). The gradient, see (9), w.r.t. 0 of the 

log-likelihood for one training pattern is then approximated by 


cD fc (0y°>) 


^p(ft|u (0) ) 


dE(y(°\h) 

00 


+ 5 Z^I' u(fe) ) 

h 


dE(vW,h) 

dO 


(32) 


The derivatives in the direction of each single parameter are obtained by “estimating” the expectations 
over p(v) in (27), (30), and (31) by the single sample v^ k \ A batch version of CD-fc can be seen in 
Algorithm 1. In batch learning , the complete training data set S is used to compute or approximate 
the gradient in every step. However, it can be more efficient to consider only a subset S' C S in every 
iteration, which reduces the computational burden between parameter updates. The subset S' is called 
a mini-batch. If in every step only a single element of the training set is used to estimate the gradient, 
the process is often referred to as online learning. 


Algorithm 1: k -step contrastive divergence 


1 

2 

3 

4 

5 

6 

7 

8 


Input: RBM (Vi,..., V m , Hi ,..., H n ), training batch S 

Output: gradient approximation Awtj , Abj and Aci for i = 1,... ,n, j = 1, 

init Aw^ = Abj = Aa = 0 for i = 1,..., n, j = 1,..., m 


forall the v £ S do 

I v < °' ) <— v 


for t — 0* ., k — 1 do 

for i = 1,..., n do sample h ~ p(hi \ v^) 

1 

for j = 1,... ,m do sample nj t+1 ' ) ~ p(vj \ hfb) 


9 

for 

i = 

1,. 

.., n, j = 1,..., m do 


10 

L 

Aw. 

ij 

- Awij + p(Hi = 1 Vj 0) - 

-p(Hi = l|«W).t;W 

11 

for 

j = 

U- 

.. ,m do 


12 

L 

Ab 3 

<r~ 

Abj + ti( 0) — r( fc) 


13 

for 

i = 

1,. 

.., n do 
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L 

Aa 

•*— 

Aa +p(Hi = 1 u t0) ) - p(Hi = 

= l|u w ) 


.., m 


Usually the stationary distribution is not reached after k sampling steps. Thus, v^ is not a sample 
from the model distribution and therefore the approximation (32) is biased. Obviously, the bias vanishes 
as k —>• oo. 

The theoretical results from [3] give a good understanding of the CD approximation and the corre¬ 
sponding bias by showing that the log-likelihood gradient can, based on a Markov chain, be expressed 
as a sum of terms containing the fcth sample: 

Theorem 1 (Bengio and Delalleau [3]) For a converging Gibbs chain 

«(°) =*h<°> W 1 ) 

starting at data point , the log-likelihood gradient can be written as 


d_ 

do 


lnp(i/ 


) 


J2p( h \ v 


( 0 ) 


d E{v(°\h) 


dO 






T hip( v (k) ju(o)) 


<91np(ib fe )) 

dO 


(33) 
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and the final term converges to zero as k goes to infinity. 

The first two terms in equation (33) just correspond to the expectation of the CD approximation 
(under pk) and the bias is given by the final term. 

The approximation error not only depends on the number k of sampling steps, but also on the rate 
of convergence or the mixing rate of the Gibbs chain. The rate describes how fast the Markov chain 
approaches the stationary distribution and is determined by the transition probabilities of the chain. 
The mixing rate of the Gibbs chain of an RBM depends on the magnitude of the model parameters [21, 
9,3,14]. This becomes clear by considering that the transition probabilities, that is, the conditional 

n m 

probabilities p(vj \ h ) and p(hi \ v), are given by thresholding Y Wijhi + bj and Y w ij v j + °i by the 

2=1 1 = 1 

sigmoid function. If the absolute values of the parameters are high, the conditional probabilities can 
get close to one or zero. If this happens, the states of the Gibbs chain get more and more “predictable”, 
and thus the chain changes its state slowly. An empirical analysis of the dependency between the size 
of the bias and magnitude of the parameters can be found in [3]. 

An upper bound on the expectation of the CD approximation error under the empirical distribution 
is given by the following theorem [14]: 


Theorem 2 (Fischer and Igel [14]) Let p denote the marginal distribution of the visible units of 
an RBM and let q be the empirical distribution defined by a set of samples V\, .., v^. Then an upper 
bound on the expectation of the error of the CD-k approximation of the log-likelihood derivative w.r.t 
some RBM parameter 9 a is given by 


E. 




Ep( v (k) | v (0)) 


91np(-i/ fc )) 

dfa 


< \\q~P I (l-e-( m+ "^)' 


(34) 


with 


where 


and 


A = max < max di , max A 

/£{!,...,m} /£{!,...,n} 


di = max ■ 


fi = max • 


y~i i{wu >o }Wu + b L 


i =1 


E I{wu< o} w u + h 


i -1 


y ' i{wij>o} w ij+°i 

1=1 


E J { U>y<0}Wlj +Cl 
1=1 


The bound (and probably also the bias) depends on the absolute values of the RBM parameters, on 
the size of the RBM (the number of variables in the graph), and on the distance in variation between 
the modeled distribution and the starting distribution of the Gibbs chain. 

As a consequence of the approximation error, CD learning does not necessarily lead to a maximum 
likelihood estimate of the model parameters. Yuille [62] specifies conditions under which CD learning is 
guaranteed to converge to the maximum likelihood solution, which need not hold for RBM training in 
general. Examples of energy functions and Markov chains for which CD-I learning does not converge 
are given in [37]. The empirical comparisons of the CD approximation and the true gradient for RBMs 
(small enough that the gradient is still tractable) conducted in [9] and [3] show that the bias can lead 
to a convergence to parameters that do not reach the maximum likelihood. 

The bias, however, can also lead to a distortion of the learning process: after some learning iter¬ 
ations the likelihood can start to diverge (see the experiments in Sec. 7) in the sense that the model 
systematically gets worse if k is not large [13]. This is especially bad because the log-likelihood is 
not tractable in reasonably sized RBMs, and so the misbehavior can not be displayed and used as a 
stopping criterion. Because the effect depends on the magnitude of the weights, weight decay can help 



16 


to prevent it. However, the weight decay parameter A, see equation (6), is difficult to tune. If it is 
too small, the weight decay has no effect. If it is too large, the learning converges to models with low 
likelihood [13] (see Fig. 8 in Sec. 7). 

More recently proposed learning algorithms try to obtain better approximations of the log-likelihood 
gradient by sampling from a Markov chain with a greater mixing rate. 


5.2 Persistent contrastive divergence 

The idea of persistent contrastive divergence (PCD) [55] is described in [61] for log-likelihood maxi¬ 
mization of general MRFs and is applied to RBMs in [55]. The PCD approximation is obtained from 
the CD approximation (32) by replacing the sample v^ by a sample from a Gibbs chain that is in¬ 
dependent of the sample of the training distribution. The algorithm corresponds to standard CD 
learning without reinitializing the visible units of the Markov chain with a training sample each time 
we want to draw a sample v^ approximately from the RBM distribution. Instead one keeps “persis¬ 
tent” chains which are run for k Gibbs steps after each parameter update (i.e., the initial state of the 
current Gibbs chain is equal to v^ from the previous update step). The fundamental idea underlying 
PCD is that one could assume that the chains stay close to the stationary distribution if the learning 
rate is sufficiently small and thus the model changes only slightly between parameter updates [61, 55]. 
The number of persistent chains used for sampling (or the number of samples used to approximate the 
second term of gradient (9)) is a hyper parameter of the algorithm. In the canonical form, there exists 
one Markov chain per training example in a batch. 

The PCD algorithm was further refined in a variant called fast persistent contrastive divergence 
(FPCD) [56]. Fast PCD tries to reach a faster mixing of the Gibbs chain by introducing additional 
parameters wfj,bj,c( (for i = 1 ,...,n and j = 1 ,...,m) referred to as the fast parameters. This 
new set of parameters is only used for sampling and not in the model itself. When calculating the 
conditional distributions for Gibbs sampling, the regular parameters are replaced by the sum of the 
regular and the fast parameters, i.e., Gibbs sampling is based on the probabilities p(Hi = 11 v) = 

sig ^ X] ( w ij + w lj) v o + ( c » + 4)^ and P( v j = 1 1 h ) = si S ^ J2 ( w ij + w Ij)hi + (bj + frf)^ instead of 

the conditional probabilities given by (21) and (22). The learning update rule for the fast parameters 
is the same as the one for the regular parameters, but with an independent, large learning rate leading 
to faster changes as well as a large weight decay parameter. Weight decay can also be used for the 
regular parameters, but it has been suggested that regularizing just the fast weights is sufficient [56]. 

Neither PCD nor FPCD seem to increase the mixing rate (or decrease the bias of the approximation) 
sufficiently to avoid the divergence problem, as can be seen in the empirical analysis in [13]. 


5.3 Parallel tempering 

One of the most promising sampling techniques used for RBM training so far is parallel tempering 
(PT) [44,12,10]. It introduces supplementary Gibbs chains that sample from more and more smoothed 
replicas of the original distribution. This can be formalized in the following way. Given an ordered set 
of M temperatures 1 = Tj < < ■ ■ • < Tm, we define a set of M Markov chains with stationary 

distributions 

Pr{Vjh) = _L e -^ E ^ h) (35) 

for r = 1,..., M, where Z r = h e~ 7T; r E ^ v,h ^ is the corresponding partition function, and pi is exactly 
the model distribution. With increasing temperature the probability mass of the Gibbs distribution 
(35) gets more and more equally distributed (or smoother), and thus the mixing of the corresponding 
Markov chain gets more and more facilitated. As T —> oo, the uniform distribution is reached, in 
which subsequent samples of the Gibbs chain are independent from each other and thus the stationary 
distribution is reached after just one sampling step. 



17 


Algorithm 2: /c-step parallel tempering with M temperatures 


for r = 1,..., M do 


for i = 1,..., n do sample ~ p(h r ,i | 


for t = 0,..., k — 1 do 

for j = 1,..., m do sample ~ p(t 




Input: RBM (Vi,..., V m , H i,..., H n ), training batch S , current state v r of Markov chain with 
stationary distribution p r for r = 1,..., M 
Output: gradient approximation Awtj , Abj and Ad for i = 1,..., n, j = 1,..., m 

1 init Aw^ = Abj = Ad = 0 for i = 1,..., n, j = 1,..., m 

2 forall the v G S do 

3 

4 

5 

6 

7 

8 
9 

10 
11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 


for i = 1,..., n do sample h ^ t ~ 1 ^ ~ p(h r ,i \ 4 t+1 ^) 

, (*:) 

<— Vr 

/* swapping order below works well in practice [36] 

for r € {s\2 < s < M and s mod 2 = 0} do 
t swap , hi k ■*) and h^2i) with probability given by (37) 

for r £ {s | 3 < s < M and s mod 2 = 1} do 

swap (v k ,hr) and (v k _ 1 , with probability given by (37) 

for i = 1,..., n, j = L,..., m do 
j_ <- Air, ; + p(H t = 11 v)-Vj - p(Hi = 11 v[ k) )-v[ k ] 

for j = 1 ,,m do 
|_ <r- Abj + Vj - V^j 

for i = 1,..., n do 

| Aa ■<— An + p(Hi = 11 v) — p(Hi = 1 \ ) 


*/ 


In each step of the algorithm, we run k (usually A; = 1) Gibbs sampling steps in each tempered 
Markov chain yielding samples (i?i, hi),..., («m, ^m)- After this, two neighboring Gibbs chains with 
temperatures T r and T r _i may exchange particles (v r ,h r ) and [v r _i,h r _ i) with an exchange proba¬ 
bility based on the Metropolis ratio, 


min 


1 Pr (v r — 1, h r ^y2)p r — 1 (v r , hr) ) 

p r (v r ,h r )p r —\(y r —\,h r —\) J 


which gives, for RBMs, 


min 



1 


T r ~ i 


• ( E{v r , h r ) — E{v r _i 



(36) 


(37) 


After performing these swaps between chains, which increase the mixing rate, we take the (eventually 
exchanged) sample Vi of the original chain (with temperature T) = 1) as a sample from the RBM 
distribution. This procedure is repeated L times, yielding the samples ..., vi t L used for the 
approximation of the expectation under the RBM distribution in the log-likelihood gradient (i.e., 
for the approximation of the second term in (9)). Usually L is set to the number of samples in the 
(mini) batch of training data as shown in algorithm 2. 

Recently, several approaches to improving PT for RBMs have been suggested. In [11] it is shown 
how the number M of parallel chains and the values of the temperatures used can be adapted auto¬ 
matically. Other work has been focused on increasing the swapping rate by allowing samples to swap 
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not only between neighboring chains [6], and on using all samples (not only those of the first chain) to 
approximate the gradient by weighted averages [6,15]. 

Compared to CD, PT introduces computational overhead, but produces a more quickly mixing 
Markov chain, and thus a less biased gradient approximation. This can lead to better learning, as 
shown in the experiments in Sec. 7. 


6 RBMs with real-valued variables 


So far, we have only considered observations represented by binary vectors, but often one would like 
to model distributions over continuous data. There are several ways to define RBMs with real-valued 
visible units. As demonstrated by [24], one can model a continuous distribution with a binary RBM by 
a simple “trick.” The input data is scaled to the interval [0,1] and modeled by the probability of the 
visible variables to be one. That is, instead of sampling binary values, the expectation p(Vj = 11 h) is 
regarded as the current state of the variable Vj . Except for the continuous values of the visible variables 
and the resulting changes in the sampling procedure, the learning process remains the same. 

When keeping the energy function as given in (18) and just replacing the state space {0, l} m 
of V by [0, l] m , the conditional distributions of the visible variables belong to the class of truncated 
exponential distributions. This can be shown in the same way as the sigmoid function for binary RBMs 
is derived in (25). Using the same notation and writing the energy as E(v, h ) = f3(v_i,h) + Viai(h), 
we have 


p(vi | h) = p(vi | v_ u h) 


_ p(vi,v-i,h) 
p{v-i,h) 

e -E(vi,v- l ,h)j [0 ' i] 


f e- E <- v ‘’ v - l ’ h )I [0A )(vi)dvi 


e -P(v-,,h) ■ e -*iaiW I[0il] ( Vl ) 
f e -0(v-i,h) . e -«i«!(k)/ [01] ( Vi )dn ; 

e-^ i(h) / [0 ,i]fa) 

/ e~ viai ( h) I\op\ fa)<fa 


(38) 


where the characteristic function /[o,i]fa) is 1 if Vi £ [0,1] and 0 otherwise. That (38) is a truncated 
exponential with parameter ai(h ) can be seen from dropping the restriction to the interval [0,1] , 
which yields f ™ e e -!,^Z dvi = ai{h)e~ via ^ h \ 

Widely used are Gaussian-Binary-RBMs where the visible variables given the state of the binary 
hidden units are normally distributed. Visible neurons with a Gaussian distributed conditionals are 
gained (in an analogous way to (38)) by augmenting the energy with quadratic terms, which can be 
written as 


E(v, h) 


n m 


i=1 j=l 3 



fa' ~frj) 2 


n 

c fa 5 

i —1 


(39) 


see [10]. 

Making use of the factorization as in (20), the partition function of the Gaussian-Binary-RBMs 
can be written as 


x , /• >: >: -•y/o j >: 1 J . 2 j' ->: 

z = J2 ei=lj=1 j i=1 j i=1 dv 

h 

=E<fa'“n /< 


A , v } (»,■ -6<r 
; l=1 i i d Vj 


3 = 1 ' 


=x> 5 ‘*n ^ 

h j=l 


2b j w ij h i + (J2i Wij hi) 


( 40 ) 
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This yields a tractable expression when the number of hidden variables is small enough (e.g., to 
visualize the log-likelihood in experiments such as shown in Sec. 7). 

In contrast to the universal approximation capabilities of standard RBMs on {0, l} m , the subset of 
real-valued distributions that can be modeled by an RBM with real-valued visible and binary hidden 
units is rather constrained [57]. However, if we add an additional layer of binary latent variables, we 
can model any strictly positive density over a compact domain with arbitrary high accuracy [30]. 

More generally, it is possible to cover continuous valued variables by extending the definition of an 
RBM to any MRF whose energy function satisfies p(h \ v ) = J][-p(/ij I v ) and P( v I h) = YljP(vj \ h ). 
As follows directly from the Hammersley-Clifford theorem, and as also discussed in [24], this holds for 
any energy function of the form 

E(v,h) = +'Y^u j {v j ) +'Y^v i {h i ) (41) 

i,j j i 

with real-valued functions u>j, and z/,; ,i = 1,..., n and j = 1,..., to, fulfilling the constraint that 

the partition function Z is finite. Welling et al. [59] come to almost the same generalized form of the 
energy function in their framework for constructing exponential family harmoniums from arbitrary 
marginal distributions p(vj) and p{hi) from the exponential family. 

7 Experiments 

This section will present experiments illustrating the behavior of RBMs in practice. After an introduc¬ 
tion to the experimental setup, we will consider sampling from a trained RBM solving an inpainting 
task. Then, an experiment will show the difference between CD learning with and without “Rao- 
Blackwellization” [50]. After that, typical learning curves will be shown for different parameter settings 
of CD and PT. Finally, we will look at the receptive fields learned by RBM hidden units. 

All experiments in this section can be reproduced using the open source machine learning library 
Shark [27], which implements most of the models and algorithms that were discussed. 

7.1 Experimental setup 

The experiments were performed on two benchmark problems taken from the literature. As a toy prob¬ 
lem we considered a 4 x 4 variant of the Bars-and-Stripes (BAS) benchmark [38, 26]. Each observation 
corresponds to a square of 4 x 4 units. The data set is generated by randomly choosing for each pattern 
an orientation (vertical or horizontal) with equal probability first, and then picking the state for all 
units of every row or column uniformly at random. Thus, the data set consists out of 32 patterns, six 
of which can be seen in the top of Fig. 6. Furthermore, we considered the MNIST handwritten digit 
recognition benchmark [35]. The training set consists out of 60000 samples of digits, see the bottom 
of Fig. 6 for examples. Each image consists out of 28 x 28 grayscale pixels, which are binarized with a 
threshold value of 127. 

For training, the RBMs were initialized with small random weights and zero bias parameters. In 
the BAS experiments, the number of hidden units was set to 16. If not stated otherwise, 20 hidden 
units were used for modeling the MINIST data. The models were trained using gradient ascent on 
CD-fc with k G {1, 2,4,10,100} or PT with M G {4, 5,10, 50}. The temperatures used in the parallel 
chains were chosen such that the inverse temperatures were equally distributed over [0,1] (which may 
not be the optimal choice [11]). If not stated otherwise, the update rule used for gradient ascent was 
equal to the one resulting from (6) by replacing the log-likelihood by either the mean of the CD- 
or the PT-approximation over the training batch. For BAS, standard batch learning was used, while 
for MNIST the training data was split into mini-batches of 100 and 600 samples in the experiments 
in Sec. 7.5 and Sec. 7.4, respectively. The learning rate was p = 0.1 for BAS when training with 
CD and 0.05 when training with PT and p = 0.3 for CD-learning on MNIST. To keep the number of 
hyperparameters small, we did not use a momentum term (y = 0). In our experience, using momentum 
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Fig. 6. Top: Patterns from the BAS data set. Bottom: Images from the MNIST data set. 


does not qualitatively change the results of the reported experiments. If not stated otherwise, no weight 
decay term was used (A = 0). The experiments in Sec. 7.3 and Sec. 7.4 were repeated 25 times. 


7.2 Application example: Reconstruction of images 

As outlined in the introduction, a trained RBM can be used as a generative model of the targe 
distribution. That is, we can use it to sample from p(V). If the visible units correspond to pixels from an 
image, this means that we can generate images from (an approximation of) the target distribution. Now 
consider the case where we observe only a part of the image, say V\ = Vi, ..., V a = v a , o < to. Then we 
can use the trained RBM for image inpainting by sampling from p(V 0 +i ,..., V m | V\ = Vi ,..., V 0 = v 0 )- 
To do so, we clamp the observed variables to the observation, that is, we fix Vi = v ±,..., V 0 = v 0 , and 
sample the states of the remaining variables, for example using Gibbs sampling. 

This is demonstrated in Fig. 7. We trained an RBM on BAS. At different stages of the training 
procedure, we clamped the first column of the input image to a certain pattern and sampled from the 
RBM (all other states were initialized with 0.5 and Gibbs sampling was employed). In the beginning, 
when the RBM distribution was almost uniform, the samples did not resemble the target distribution, 
as shown in the first row of Fig. 7. After some training, the samples started to reveal the structure 
of the BAS distribution. As the model distribution got close to the training distribution, the RBM 
successfully reconstructed the BAS pattern in a few sampling steps, as can be seen in the third row of 
Fig. 7. 

This toy experiment indicates how an RBM can be used for classification as illustrated in Fig. 2. 
The RBM can be trained on the joint probability distributed of the data and the corresponding labels. 
After training, a new image is clamped to the corresponding units (i.e., the corresponding visible 
units are fixed to the pixel values), and the label units are sampled. Another possibility of course is 
to use the RBM weights to initialize a feed-forward neural network augmented with an output layer 
corresponding to the labels, which can then be fine tuned in a supervised way for classification. 


7.3 To sample or not to sample? 

Algorithm 1 differs in a small detail from the description of the CD algorithm in [2], To approximate 
the first sum of the log-likelihood gradient we use the probabilities p(Hi = 11 th°)), * = 1,..., n, exactly 
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Fig. 7. Reconstruction of an incomplete image by sampling from an RBM. Each row shows the initialization 
of the visible units and the first four samples from the Gibbs chain. First row: After one iteration of batch 
learning (log-likelihood 356,154). Second row: After 1500 iterations of batch learning (log-likelihood 210,882). 
Third row: After 4000 iterations of batch learning (log-likelihood 154,618). 


(e.g., see lines 10 and 14 in Algorithm 1), while in [2] this quantity is approximated by the h\°' > from 
the Gibbs sampling procedure. 

To see the difference, RBMs were trained on BAS using the two different approaches and the 
log-likelihood was calculated in every iteration. The results are shown in Fig. 8. 




Fig. 8. Evolution of the log-likelihood during training of an RBM with CD-I where the first expectation is 
calculated directly (solid line) or approximated by a sample (dashed line). Left: Medians over 25 runs, error 
bars indicate quartiles. Right: Exemplary single runs. 
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Both procedures led to similar log-likelihood values, but using the expectation instead of the sample 
reduced the variance of both the log-likelihood values in a single trial (i.e., oscillations were reduced) 
as well as in between the different trials as indicated by the error bars in Fig. 8. This result was to 
be expected, the additional sampling noise increases the variance, while using the expectation reduces 
the variance of the estimator. The latter is clear from the Rao-Blackwell theorem as also mentioned in 
the recommended review by Swersky et al. [50]. 


7.4 Learning curves using CD and PT 

The following experiments shall give an idea about the evolution of the log-likelihood during RBM 
training. The considered RBMs have so few neurons that the log-likelihood is tractable and can be 
used to show the learning process over time. 




Fig. 9. Evolution of the log-likelihood during training of RBMs with CD-fc where different values for k were 
used. The left plot shows the results for BAS (from bottom to top k = 1,2, 5,10, 20, 100) and the right plot for 
MNIST (from bottom to top k = 1,2, 5,10, 20). The values are medians over 25 runs. 


The learning curves of CD-fc with different numbers of sampling steps fc are depicted in Fig. 9. 
Shown are the median values over 25 trials. As could be observed in the BAS example, for some 
learning problems, the log-likelihood steadily decreases after an initial increase if fc is not large enough. 
Thus, after some iterations of the learning process the model starts to get worse. This behavior can be 
explained by the increasing magnitude of the weights: as explained in Sec. 5.1, the mixing rate of the 
Gibbs chain decreases with increasing absolute values of the weights and thus the CD approximation 
gets more and more biased. As suggested by Theorem 1 and Theorem 2, the larger fc the less biased the 
approximation gets. Accordingly, the experiments show that the larger fc the less severe the divergence 
and the higher the maximum log-likelihood value reached during training. The effect of weight-decay 
with different weight-decay parameters A is shown in the left plot in Fig. 10. The results indicate 
that the choice of the A is crucial. If chosen correctly, the divergence problem was solved. But if the 
hyperparameter A was too large, learning stagnated on a low log-likelihood level and thus the RBMs 
did not model the target distribution accurately. And if the parameter was too small, the weight decay 
term could not prevent divergence. 

As can be seen in the right plot of Fig. 10, the performance of PT on BAS clearly depends on the 
number of Gibbs chains used in parallel. The more chains are used, the better the mixing, leading to 
better gradient approximations and thus better learning. Compared to CD-I, PT-1 (i.e., PT with fc = 1 
sampling step performed in every chain at each learning iteration) led to significant higher likelihood 
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values, but needed more computational power. However, PT-1 with M = 4 was comparable to CD 
with k = 10 in terms of model quality, but computationally less demanding. It seems that divergence 
problems can be prevented with PT if the number of parallel chains is not too small. 

Of course, the wallclock runtime of the learning algorithms strongly depends on the implementation. 
However, to give an idea about actual runtimes we will report some measurements. All results refer 
to the median of 5 trials. In our experiments, learning BAS with 1000 iterations of CD-I took 0.14 s. 
Changing to CD-10 increased the runtime to 0.8 s. Using PT with M = 5 instead required 0.67 s for 
the same number of iterations. Increasing the number of chains to M = 20 increased the runtime to 
2.18 s. Running CD-I on MNIST for 100 iterations over mini-batches with 600 samples took 84.67 s 
for RBMs having 500 hidden units. 



Fig. 10. Evolution of the log-likelihood during training of an RBM on BAS. Left plot: with CD-I and different 
values of the weight decay parameter (from bottom to top: A = 0.05,0,00005,0,0005). Right Plot: with PT 
with different numbers M of temperatures (from bottom to top M = 4,5,10,50). The values correspond to 
the medians over 25 runs. 


7.5 Hidden units as filters 

As mentioned in the introduction, the hidden units can be viewed as feature detectors. Let the visible 
units correspond to the pixels of an image. Then we can ask how observed images should look like in 
order to activate a certain hidden unit most strongly (i.e., to lead to a high probability of this unit 
being 1). To answer this question, we color-code the weights of the hidden unit and plot them arranged 
as the corresponding visible units. The resulting image visualizes the preferred input pattern, which is 
referred to as the learned filter or, in biological terms, the receptive field of the hidden neuron. 

Figure 11 shows the weights of RBMs with 16 and 100 hidden units trained on MNIST for 100 
epochs. Each square corresponds to the 784 weights of one hidden neuron to the 28 x 28 = 784 visible 
neurons. The squares are ordered according to the probabilities of the corresponding hidden units to 
be 1 given the training set in decreasing order. 

When only 16 hidden units were used, the learned filters are rather complex and it is even possible 
to recognize digits in them. When 100 hidden units were used, the receptive fields get more localized 
and show stroke-like features. 

3 The experiments were conducted on a computer with an Intel Core i3 CPU with 3.07GHz running Ubuntu 
3.2.0. We used the Shark library [27] and gcc 4.8.0, and the learning ran in a single thread. 
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Fig. 11. Visualization of the weights of RBMs with 16 and 100 hidden units (left and right plot, respectively) 
trained on MNIST. In the two plots, each square image shows the weights of one hidden unit. These images 
have the size of the input images, and each weight is visualized at the position of the corresponding visible 
unit. The gray values represent the size of the weights. 


8 Where to go from here? 

The previous sections have introduced the most basic RBM models. However, several generalizations 
and extensions of RBMs exist. A notable example are conditional RBMs (e.g., [54,39]). In these 
models, some of the parameters in the RBM energy are replaced by parametrized functions of some 
conditioning random variables, see [2] for an introduction. An obvious generalization is to remove the 
“R” from the RBM, which brings us back to the original Boltzmann machine [1]. The graph of a 
BM corresponds to the graph of an RBM with additional connections between the variables of one 
layer. These dependencies make sampling more complicated (in Gibbs sampling each variable has to 
be updated independently) and thus training more difficult. However, specialized learning algorithms 
for particular “deep” graph structures have been developed [45]. 

The goal of this article was to introduce RBMs from the probabilistic graphical model perspective. 
It is meant to supplement existing tutorials [2,50,20], and it is biased in the sense that it focuses on 
material that we have found helpful in our work. We hope that the reader is now equipped to move on 
to advanced models building on RBMs—in particular, to deep learning architectures, where [2] may 
serve as an excellent starting point. 
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